ggplot2

ggplot2 is the most elegant and aesthetically pleasing graphics framework available in R. The way you make plots in ggplot2 is very different from base graphics making the learning curve steep. That said, it’s totally worth it.

#Within each document, it is important to call the ggplot2 package so it knows you will be using functions/data/etc from inside that package
library(ggplot2)
library(tidyverse)
## ── Attaching packages ──────────────────────── tidyverse 1.3.0 ──
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ✔ purrr   0.3.3
## ── Conflicts ─────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(RColorBrewer)

It’s essential that you properly organize your data into a data frame before you start with ggplot2. This why we spend the last week or two focus on learning ways to transform and wrangle data into different formats.

Once you have your data ready to go then you gradually add bits and pieces to it to create a plot. Plots are built up in layers, with the typically ordering being

  1. Plot the data
  2. Overlay a summary
  3. Add metadata and annotation

Basics

We will be working with the dataset midwest. It contains information on many different counties in Illinois, Indiana, Michigan, Ohio, and Wisconsin.

data(mpg)

The Setup

# ggplot ( dataframe, aes(x=xvariable, y=yvariable))
# aes stands for aesthetics

# inital ggplot
ggplot(mpg, aes(x=cty, y=hwy))

A blank ggplot is drawn. Even though the x and y are specified, there are no points or lines in it. This is because, ggplot doesn’t assume that you meant a scatterplot or a line chart to be drawn. I have only told ggplot what dataset to use and what columns should be used for X and Y axis. I haven’t explicitly asked it to draw any points.

Plotting Points

The basics:

ggplot(mpg, aes(x=cty, y=hwy)) +
  geom_point()

##OR 

ggplot(mpg, aes(x=cty, y=hwy)) %>% 
   + geom_point()

To customize colors, plotting characters, size:

ggplot(mpg, aes(x=cty, y=hwy)) + 
  geom_point(col="steelblue", pch=18, size=2)

A list of possible pch values

A list of possible pch values

Adding Layers

Let’s make a scatterplot on top of the blank ggplot by adding points using a geom layer called geom_point.

ggplot(mpg, aes(x=cty, y=hwy))  +
  geom_point(col="steelblue", pch=18, size=2) +
  labs(title="Scatterplot", subtitle="City MPG vs. Highway MPG", y="Highway MPG", x="City MPG", caption="source: mpg") 

# + xlab("Area") 
# + ylab("Population")
#  + xlim(c(0,30)) %>% 
#  + ylim(c(0,40)) %>% 

The warning is being given because we have adjusted the x and y axis to exclude some points.

Colors by Group

gg <- ggplot(mpg, aes(x=cty, y=hwy))  +
  geom_point(aes(col=class), pch=18, size=2)  +
  labs(title="Scatterplot", subtitle="City MPG vs. Highway MPG", y="Highway MPG", x="City MPG", caption="source: mpg") 

gg

As an added benefit, the legend is added automatically. If needed, it can be removed by setting the legend.position to None from within a theme() function.

gg + theme(legend.position="None")  # remove legend

Also, You can change the color palette entirely.

gg + scale_colour_brewer(palette = "Spectral")  # change color palette

More of such palettes can be found in the RColorBrewer package

RColorBrewer palettes

RColorBrewer palettes

You can also build your own color palettes using the built in colors in R or by using HEX codes (ie. #RRGGBB )

R Built In Colors

R Built In Colors

We will spend more time later in the course discussing best practices for color choices, but for now keep in mind:

  • use intuitive/meaningful colors, if possible
  • make to use colors with high contrast (exception: avoid red and green if possible)

Adding Text, Labels, and Annotation

ggplot(mpg, aes(x=cty, y=hwy, label=mpg$model)) +
  geom_jitter(aes(col=class), pch=18, size=2) +
  geom_text(size=1, hjust=0, vjust=0)  

Using Themes

Themes can be a useful way to “style” an entire graph at once. Common themes are theme_classic(), theme_dark(), theme_bw(), and theme_grey().

gg + theme_bw()

library(ggthemes) contains lots of additional themes including theme_wsj() (Wall Street Journal), theme_economist() (The Economist), theme_fivethirtyeight() (Five Thirty Eight), etc.

library(ggthemes) #make sure you have run install.packages("ggthemes") on your computer at some point
gg + theme_wsj() + scale_color_wsj()
## Warning: This manual palette can handle a maximum of 6 values. You have
## supplied 7.
## Warning: Removed 62 rows containing missing values (geom_point).

More Plot Types

Histograms

Histograms should be used for one continuous variable.

ggplot(mpg, aes(cty)) + 
  scale_fill_brewer(palette = "Spectral") +
  geom_histogram() + # change binwidth
  labs(title="Histogram with Auto Binning", 
       subtitle="City MPG") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(cty)) +
  scale_fill_brewer(palette = "Spectral") +
  geom_histogram(binwidth=2) +   # change binwidth
  labs(title="Histogram with Auto Binning", 
       subtitle="City MPG") 

ggplot(mpg, aes(cty)) +
  scale_fill_brewer(palette = "Spectral") +
  geom_histogram(binwidth=2) +   # change binwidth
  labs(title="Histogram with Auto Binning", 
       subtitle="City MPG") +
  coord_trans(x="log10")

Boxplots

Boxplots should be used for one continuous variable. Side-by-Side Boxplots can be good for comparing a numerical variable across many different levels (categories).

ggplot(mpg, aes(class, cty))  +
  geom_boxplot( fill="plum", outlier.size=1) +
  labs(title="Box plot", 
         subtitle="City Mileage grouped by Class of vehicle",
         caption="Source: mpg",
         x="Class of Vehicle",
         y="City Mileage") 

mpg %>% 
  mutate(class = reorder(class, cty, median )) %>% 
  ggplot(aes(class, cty))  +
  geom_boxplot( fill="plum", outlier.size=1) +
  labs(title="Box plot", 
         subtitle="City Mileage grouped by Class of vehicle",
         caption="Source: mpg",
         x="Class of Vehicle",
         y="City Mileage") 

Barplots

Barplots should be used for one or two categorical variables.

ggplot(mpg, aes(manufacturer)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle=90)) +
  labs(title="Barplot on One Categorical Variable", 
       subtitle="Manufacturer across Vehicle Classes") 

## OR 
ggplot(mpg, aes(x = manufacturer)) +
  geom_bar(fill="blue") + 
  #+ theme(axis.text.x = element_text(angle=90)) 
  labs(title="Barplot on One Categorical Variable", 
       subtitle="Manufacturer across Vehicle Classes") + 
  coord_flip() 

  #+ scale_fill_brewer(palette = "Spectral")
ggplot(mpg, aes(manufacturer)) + 
  geom_bar(aes(fill=class)) +
  labs(title="Barplot on Two Categorical Variables", 
       subtitle="Manufacturer across Vehicle Classes") +
  theme_classic() +
  theme(axis.text.x = element_text(angle=90)) +
  scale_fill_brewer(palette = "Spectral")

The are so many different ways to modify the themes - the legend, where the axis ticks go, the background colors, the position of text, the font, etc. You can get a the full scope of all the options by typing ?theme into the console.

Colors (Hidden)

Set ggplot color manually:

scale_fill_manual() for box plot, bar plot, violin plot, dot plot, etc scale_color_manual() or scale_colour_manual() for lines and points Use colorbrewer palettes:

scale_fill_brewer() for box plot, bar plot, violin plot, dot plot, etc scale_color_brewer() or scale_colour_brewer() for lines and points Use grey color scales:

scale_fill_grey() for box plot, bar plot, violin plot, dot plot, etc scale_colour_grey() or scale_colour_brewer() for points, lines, etc

#custom.col <- c("#FFDB6D", "#C4961A", "#F4EDCA",  "#D16103", "#C3D7A4", "#52854C", "#4E84C4", "#293352")

Line Plots

gapminder <- read.csv("https://ebmwhite.github.io/MATH0216/activities/gapminder.csv")
ggplot(gapminder, aes(x=year, y=lifeExp, group=country)) +
  geom_line()

gapminder %>%
    group_by(continent, year) %>%
    summarise(lifeExp=median(lifeExp)) %>%
    ggplot(aes(x=year, y=lifeExp, color=continent)) +
     geom_line(size=1) + 
     geom_point(size=1.5)

Gapminder

gapminder %>% 
  filter(year==1952) %>% 
  ggplot( aes(gdpPercap, lifeExp, color = continent)) +
  geom_point(aes(size = pop)) +
  scale_x_log10()

Plotly

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
gg <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent, frame = year)) +
  geom_point(aes(size = pop)) +
  #geom_smooth(se = FALSE, method = "lm") +
  scale_x_log10()
ggplotly(gg) %>% 
  highlight("plotly_hover")

Activity

mpg NYCairbnb2019.csv gapminder

#library(openintro)
#cars

#library(tidyverse)
data(diamonds)
force(diamonds)
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows
#NYCairbnb2019.csv

surveys <- read.csv("https://ebmwhite.github.io/MATH0216/data/sample.csv")
gapminder <- read.csv("https://ebmwhite.github.io/MATH0216/activities/gapminder.csv")
UNCdata <- read.csv("http://ryanthornburg.com/wp-content/uploads/2015/05/UNC_Salares_NandO_2015-05-06.csv")

Quick Reference

Here are some resources that may be useful quick reference guides for ggplot2: